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Abstract 

Information network analysis lias drawn a lot attention in recent years. Among all the aspects of network analysis, 
similarity measure of nodes has been shown useful in many applications, such as clustering, link prediction and 
community identification, to name a few. As linkage data in a large network is inherently sparse, it is noted that 
collecting more data can improve the quality of similarity measure. This gives different parties a motivation to 
cooperate. In this paper, we address the problem of link-based similarity measure of nodes in an information network 
distributed over different parties. Concerning the data privacy, we propose a privacy-preserving SimRank protocol 
based on fully-homomorphic encryption to provide cryptographic protection for the links. 

Index Terms 

Privacy; Similarity; 

I. Introduction 

Real network data usually consist of objects of different types, forming a so called heterogeneous network. A 
heterogeneous network contains abundance of information, which gives rise to a great interest in its analysis such as 
clustering, classification, centrality measure, object ranking and pattern mining [4,7, 14], to name a few. Among all 
the aspects of heterogeneous network analysis, measuring the similarity between nodes is one of the most important 
problems. Answering how similar nodes are is essential in many applications, such as clustering, link prediction 
and community identification [1,5, 12]. 

According to the evaluation of [13], link-based similarity measures [6, 9, 10] well conform with human judgement. 
In the work of [10], the similarity of two nodes can be understood as a weighted count of the number of all-length 
paths between the nodes. In [9], the similarity can be regarded as the probability that a node A reaches a node B in 
a restricted random walk manner Both the measures proposed in [10] and [9] are applied on undirected networks. 
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Fig. 1 . The rental data of on-line movie stores A, B and C. 

SimRank [6] defines the similarity score of two nodes A and B as the expected meeting distance, i.e., the expected 
distance that two random surfers, starting at node A and node B respectively, travel before they meet at the same 
node. Among all the link-based similarity measures, SimRank[6] is one of the most influential approaches [11] 
for following reasons. First, SimRank similarity score is independent to the clustering approaches, and thus can be 
appUed to various kinds of clustering methods. Second, SimRank can be applied on both directed and undirected 
graphs, which give a great flexibility in analyzing heterogeneous networks. Furthermore, SimRank can work on any 
graphs that have object-to-object relations, without the need of domain knowledge. 

In reality, however, data are often distributed among different parties. For example. Figure 1(a) shows the rental 
data in three on-line movie stores. Consider that an on-line movie store wants to build a recommendation system for 
its customers. Based on the rental records, SimRank can provide a good similarity measure for clustering the movies 
of similar types. However, the clustering result may still not be good enough since the amount of rental records 
is also a dominant factor. Combining the rental data from different stores for clustering, as shown in Figure 1(b), 
surely can improve the performance of movie recommendation systems. This gives the on-line stores a motivation 
to cooperate. 

Whenever different parties cooperate, data privacy is an important concern. In the above example, the on-line 
movie stores may not be willing to reveal their network data to each other since the links indicating the rental 
information, i.e., customer-rent-movie, are usually one of the most valuable assets for a company. Revealing the 
rental information of a company may allow its competitor to be able to analyze its customer behavior and use 
the information against the company. Therefore, how to cooperatively compute SimRank similarity score without 
exposing any sensitive information, i.e., links between objects, becomes a crucial problem. To the best of our 
knowledge, there is not yet any studies addressing this problem. 

It is challenging, under the constraint of link protection, to calculate the link-based similarity of nodes in a 
network distributed over different parties. As the similarity calculation is based on the hnk information of the entire 
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network, it is essential to securely integrate the distributed data into a synthesized graph. For example, according 
to the similarity definition of SinoRank, two nodes are seen as similar if the neighbor nodes they connect to are 
themselves similar to each other. As the result, the entire graph structure needs to be considered when computing 
similarity scores between nodes. However, synthesizing graphs without exchanging any link information seems 
daunting. Without the knowledge of the entire graph structure, we can't even correctly identify the neighbors of a 
certain node. How to build a combined graph having access to the information we need and avoid revealing each 
party's Unk information in the same time is an issue needed to be solved. 

Specifically, in this paper we take the link-based similarity measure defined by SimRank [6] and address the 
problem of privacy-preserving similarity measure in a joint bipartite network consisting of data from two or more 
parties. For such a problem, it has the challenges on (1) how each party can guarantee its fink protection in a way 
that allows the construction of a structure containing the information involved for similarity measure, and (2) how 
the similarity is calculated on such a structure. To tackle these challenges, we propose a new protocol called privacy- 
preserving SimRank (PP-SimRank) based on fuUy-homomorphic encryption FHE [2,3]. FHE takes only 1 or as 
inputs, outputs different ciphertexts with random noise introduced, and has both the addition and multipUcation 
homomorphic properties. Therefore, the XOR- and AND- operations on a plaintext are equivalent to the addition 
and multipUcation on the corresponding ciphertext. PP-SimRank then utiUzes these properties to construct virtual 
joint networks in ciphertext field. By extending the similarity definition to the virtual networks, PP-SimRank is able 
to calculate SimRank scores without knowing the link information in the bipartite network of each party. We show 
that PP-SimRank can protect the Unk information in a semi-honest model, where a semi-honest security model 
restricts all parties to faithfuUy follow their specified protocol, but allows the parties to record all the intermediate 
messages for analysis. In addition, to overcome the constraint of FHE taking only 1 or as inputs, PP-SimRank 
converts the SimRank scores into binary representations and carries out the score computation in ciphertext field 
with binary arithmetic operations. That is, we demonstrate an application of FHE in practice. 

GeneraUy, the contributions of this paper include: 

1) We are the first to address the problem of Unk-based similarity measure co-computation over distributed 
information network. 

2) We propose a privacy-preserving protocol, PP-SimRank, to securely compute SimRank similarity score over 
distributed data. PP-SimRank is the first privacy-preserving protocol that focuses on Unk-based similarity 
measure co-computation. 

3) We carry out the implementation of basic arithmetic operations, i.e., addition, subtraction, multiplication and 
division, in the ciphertext field under fuUy-homomorphic encryption, demonstrate an appUcation of fuUy- 
homomorphic encryption scheme and show its potential in privacy-preserving data mining. 

The rest of this paper is organized as follows. Section II formally defines our problem and provides the background 
knowledge to our work. Section III describes our Privacy-Preserving SimRank protocol in details, and Section IV 
discusses the implementation issues. We then analyze the complexity and security of the proposed PP-SimRank in 
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Section V, extend PP-SimRank to multi-party scenario in Section VI, and conclude this paper in Section VII. 

II. Preliminary 

In this section, we formally define the problem and provide the necessary background. We first formulate the 
problem in Section 11- A, and give a brief review of SimRank in Section II-B. 

A. Problem Formulation 

In this paper, we address the problem of privacy-preserving computation of Unk-based similarity measure over 
horizontally partitioned information network in different parties. Our target is to protect the link information and 
achieve the computation of SimRank score simultaneously. For simpUcity, we first focus on two-party scenario, and 
then show the extension to general cases. 

In a two-party scenario, two parties called Alice and Bob hold information networks = {V"^, U,E^) and 
= {V^, U, E^), respectively. Without loss of generality, we assume that and are disjoint vertex sets, 
and f/ is a vertex set known to both Alice and Bob. The parties wish to cooperatively learn the SimRank similarity 
scores S{ui,Uj) of all vertex pairs {ui,Uj} G U using their joint data, i.e., G = {V"^ U V^,U,E^ U E^). For 
privacy concern, the problem is how to construct the virtual information network G without reveaUng E^ and E^ 
and how to perform the SimRank computation on the virtual information network G. 

For example, consider that Alice and Bob are two on-line movie stores. Alice and Bob have different customer 
groups, V"^ and V^, respectively, and the same movie set U. A rental record is then a link connecting Vi with 
Uj. When both Alice and Bob regard their own rental records as sensitive data, the problem is that, how Alice 
and Bob can construct a virtual graph G = {V^^ U V^,U,E^ U E^) without revealing E^ = {{vf,Uj)} and 
E^ = {{vf,Uj)} to each other, and learn the SimRank similarity score S{ui, Uj) of all the movie pairs to improve 
their movie recommendation system. 

For this problem, ideally, Alice and Bob would find a trusted third party to perform all the needed calculations 
for them. That is, Alice and Bob would securely transmit their data to this trusted third party. Let the third party 
compute the movies' pairwise similarity scores using SimRank and send the result information back to both Alice 
and Bob. However, such independent third parties are very hard to find or do not exist in the real world. 

Instead of counting on a third party, we proposed a privacy-preserving protocol, PP-SimRank, based on fuUy- 
homomorphic encryption (FHE) [2, 3] to protect the link information. The fuUy-homomorphic encryption method 
takes the plaintexts of binary form, i.e., {0, 1}, as inputs and outputs the ciphertexts in the form of big integers. Given 
the same plaintext and the same public key for encryption, FHE will output different ciphertext (big integer) with 
random noise, i.e., there is a negligibly small probability [3, 8] that Enc(l) = Enc(l). Therefore, FHE is semantic 
secure against chosen plaintext attacks and is suitable for Unk encryption. In addition, FHE has the following two 
properties. 
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PROPERTY 1. Addition-homomorphic 

Ci + C2 = Enc{mi) + Enc{m2) = Enc{mi ® 7712), (1) 
PROPERTY 2. Multiplication-homorphic: 

ci X C2 = Enc{m-\) x Enc{m2) = Enc{mi A 7712), (2) 

where the two ciphertexts ci = Enc{mi) and C2 = Enc{m2) represent the encryption outcomes of the plaintexts 
mi and m2 respectively. Accordingly, the addition and multipUcation operations on the ciphertexts are equivalent to 
the XOR- and AND-operations on the corresponding plaintexts. We can, therefore, carry out arithmetic operations 
such as addition, subtraction, multiphcation and division in the ciphertext field, using the homomorphic properties 
of FHE [2,3]. By restricting the computation of SimRank similarity score in the ciphertext field, we can achieve 
the protection of Unk information. 

Further, we will show that PP-SimRank can keep the hnk information remain hidden in a semi-honest security 
model. 



B. SimRank Overview 

SimRank [6] is a similarity measure based on the linkage information in a network. Two objects are regarded 
as similar ones if they are related to similar objects. Specifically, let S{vi,Vj) denote the similarity between two 
nodes Vi and vj in a network G = {V, E). The iterative similarity computation equation of SimRank is as follows: 

, \0(v,)\\0(v,)\ 

= |Q(.OI|Ofa)| g g 'iO^^^Miv,)), (3) 

\I{u,)\ \I{u,)\ 

where 0{vi) or 0{vj) denote the set of out-neighbors of nodes Vi or Vj, I{ui) or I{uj) denote the set of in-neighbors 
of nodes Ui or Uj, and dout {din) is the decay factor. 

To simphfy the SimRank score computation, SimRank derives a node-pair graph = {V'^,E^). Each node w 
in is associated with a node pair < a, 6 > in G, i.e., a,b £ V. The presence of an edge between two nodes 
< o, 6 > and < c, d > in represents that there exist edges (a, c) and (6, d) in G or, there exist edges (a, d) and 
(6, c) in G. In G^, each node w is associated with a SimRank score giving a measure of similarity between the 
two nodes of G represented by w. The neighbors of w are the nodes whose similarity is needed to be considered 
when the SimRank score of w is computed. 

Figure 2(a) shows an example of the joint rental data of two on-line movie stores Alice and Bob. Figure 2(b) 
is the corresponding node-pair graph G^, where the isolated nodes are omitted. There is an edge in G^ between 
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Fig. 2. The combined renter data of online movie stores Alice and Bob. 
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Fig. 3. Privacy-preserving SimRank protocol. 

the nodes < Moviel, Movie2 > and < Renterl, Renter2 >, because Renter 1 rents Movie 1 and Renter 2 rents 
Movie 2 in G. 

In SimRank method, the node-pair graph is proposed to explicitly figure out the propagation of similarity 
scores between pairs of nodes. Later, we will further incorporate the node-pair graph into the definition of 
similarity so that PP-SimRank is able to calculate the SimRank score in ciphertext field. 

III. Privacy-preserving SimRank Protocol 

A. Protocol Overview 

We now introduce a new protocol called Privacy-Preserving SimRank (PP-SimRank) based on fully-homomorphic 
encryption (FHE). In this protocol, there are two essential roles: cryptographer and calculator. Cryptographer 
determines the public-secret key pair (pk, sk) used for encryption/decryption of data, shares the public key with 
other while keeping the secret key strictly to himself, and performs the decryption of encrypted SimRank score. 
Calculator collects encrypted data from all parties, constructs the virtual network in the ciphertext field, and performs 
the SimRank score calculation on the virtual network. As cryptographer does not have the virtual network and 
calculator does not have the secret key for decryption, the SimRank score computation over distributed information 
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network can be achieved while the Unk information of each parties' data is protected. 

Figure 3 shows the procedure of PP-SimRank protocol. First, one party, called Ahce, determines the public-secret 
key pair (pk, sk), and sends the public key pk to the other party, called Bob. After that, both Ahce and Bob encode 
and encrypt their data with pk, respectively, Ahce then also transmits her encrypted data to Bob so that Bob will 
hold a joint virtual information network G = {V^ U V^, U, U E^) in the ciphertext field. Based on the joint 
network G, Bob further builds a node-pair graph G^. Later, by incorporating the virtual node-pair network into 
the definition of SimRank similarity. Bob performs the SimRank calculation in ciphertext field. Finally, Bob adds 
a random number to each of the final SimRank scores of the vertex pairs in U'^, and sends the results to Ahce for 
decryption. Both Ahce and Bob can then learn the SimRank similarity scores of the vertex pairs in U"^ after they 
exchange the decrypted scores and the corresponding added random numbers. 

In this protocol, however, there are several challenging problems needed to be solved. First, the bipartite network 
of each party has to be encoded and encrypted in a way that allows the constructions of the joint network G as well 
as the corresponding node-pair graph while the hnk protection is guaranteed. Second, based on the node-pair 
graph buUt in ciphertext field, the whole process of SimRank similarity computation has to be restricted in the 
ciphertext field as well, since the FHE homomorphic properties hold only if the plaintexts are encrypted with the 
same key. Moreover, it makes the problem even more challenging that FHE has the constraint of taking only 1 or 
as inputs while the similarity score is a real number in [0, 1]. We will tackle these challenges step by step in the 
following subsections. 

In Section III-B, we explain how Alice and Bob encrypt their data so that Bob is able to construct the graph 
matrix of the node-pair graph G^ in the ciphertext field. Section III-C shows how exactly the graph matrix of G^ 
is constructed. Based on G^, Section III-D gives the details of how the SimRank score computation is achieved, 
when the computation is restricted in the ciphertext field. 

B. Encoding and encryption of links 

We now explain how the parties encode and encrypt their data in details. Without loss of generality, we assume 
that both Alice and Bob know the total number of nodes in both U and {V^ U V^), i.e., m = \V^\ + \V^\ and 
n= \U\, and all the nodes are ordered. 

Note that the bipartite graph in our problem is not a weighted one. The direct relationship between two nodes 
is either connection or discormection, i.e., a hnk e e {0, 1}. In many cryptographic encryptions, it will output the 
same ciphertext given the same input and encryption key. As a result, the link protection still cannot be guaranteed 
even though the data is encrypted. 

To tackle this challenge, we propose PP-SimRank based on FHE [2, 3], which takes only 1 or as possible input 
and outputs different ciphertexts for a given input bit by introducing random noises in the encryption. That is, after 
encryption, the ciphertext indicating the same bit varies. To hide the hnk information more effectively, we let the 
parties encode not only the connections but also the disconnections between nodes into bit vectors and encrypt the 
bit vectors using FHE. For Bob who receives Ahce's data, he thus cannot distinguish the ciphertexts corresponding 
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Fig. 4. Two cases lead to an edge between {vxi,Vx2) and (uyi,Uy2). 
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to the 1- bits and 0- bits. The link information is effectively protected. On the other hand, as the link information 
is encoded into bits independently and encrypted without other artificial noise (in addition to the random noise 
introduced by FHE), Bob is able to utilize the homomorphic properties to construct the virtual graphs G and G^. 

Specifically, for each node Vi in V"^ (or V^), Alice (or Bob) will generate a bit vector i?^ (or E^,), where 
E:^.,E^. G {1,0}", with respect to the connections and disconnections of Vi to all the nodes in U based on her 
(his) own data. Each element e^.j in (or E^,) is 1 if there is a link in G"^ (or G^) connecting Vi and Uj € U, 
or otherwise. After encoding all the link information into bit vectors, the parties encrypt every element eij in 
the bit vectors i?^ and E^. to get the cipher vectors C^. and C^., where each element Cij — Enc{eij), using the 
same public key pk (determined by Alice in the previous step). According to the homomorphic properties of FHE, 
Bob can thus collect the cipher vectors and to construct a virtual joint bipartite graph in the ciphertext 
field. 

C. Construction of Virtual Graphs 

This subsection shows how a joint virtual network G^, where G = {V^ U , U, E^ U E^), is constructed in 
ciphertext field. 

To construct the virtual network G^, knowing the Unk information is essential. However, since the link information 
is hidden in the cipher vectors G^ and G^, it is impossible for Bob to construct a graph with explicit link 
information. Instead, we let Bob construct the virtual graph G^ by filhng the graph matrix M of G'^ in ciphertext 
field. Each cipher element of M can be derived by applying the homomorphic operations on the cipher vectors C^. 
and as follows. 

Specifically, let Wx =< Vj.i,Vx2 > denotes a node-pair in G^, where u is a node in (V^ U V^), and Wy =< 
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Uxi,Ux2 > denotes the node-pair of U in G^. Recall that, in G^, there is an edge ex,y connecting the nodes 
Wx =< Vxi,Vx2 > and Wy =< Uyi,Uy2 > if and only if (as shown in Figure 4), in G, at least one of the two 
conditions in the following is satisfied. 

1) {^xi,yi> 6x2,2/2} € {E^ U E^): Uy\ is an out-neighbor of Vxi and Uy2 is an out-neighbor of Vx2- 

2) {exi,y2, ex2,yi} € {E^^ U E^): Uy2 is an out-neighbor of Vxi and Uyi is an out-neighbor of Vx2- 
That is, the value of ex,y can be derived by the following formula: 

ex,y = easel V ease2, (5) 

where 

easel = exi,yi A ex2,y2, 
ease2 = exi,y2 A ex2,yi- 

When the OR-operation is carried out by XOR- and AND- operations, Equation (5) can be rewritten as follows 
based on Figure 5. 

ex,y = {easel (B ease2) (B {easel A case2). (6) 



Now consider this problem in ciphertext field and let Trx,y be the corresponding cipher of ex,y in M. Due to the ho- 
momorphic properties of FHE, the XOR-/ AND- operation on the plaintext is equivalent to the addition/multiplication 
operation on the corresponding ciphertext. The Value of Trx,y can thus be derived as follows. 

"^Xjj/ — Enepii{cx,y} 

= {Encpk{easel) + Enepk{ease2)) 
-|-(iJnCpk(casel) x £'ncpk(case2)), 

where 

-E'ncpk(casel) = ex\,yi x ex2,y2, 
Enepk{ease2) = exi,y2 x ex2,yi- 

Consequently, Bob can construct a virtual graph for subsequent SimRank score computation, while the Unk 
information is protected and hidden in ciphers. 
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D. Computation of SimRank score 

In this subsection, we explain in details that how Bob computes the SimRank score in ciphertext field, based 
on the virtual network G^. As Equations (3) and (4) show, the SimRank similarity of two nodes Vxi and Vx2 is 
defined as the normaUzed sum of the similarities between their neighbors. In the ciphertext field, however, the 
neighborhood information as well as the Unk information is protected and hidden in the encrypted graph matrix M 
of G^. To compute the SimRank score, we then need to incorporate the graph matrix M into Equations (3) and 
(4), and express the formulas correspondingly in ciphertext field. 

First, consider to incorporate the graph matrix M into Equations (3) and (4). For v € {V^ U V^) and u G U, 
let Wx =< Vxi,Vx2 > and Wy =< Uy\,Uy2 > denote the node-pairs in G^, and ex,y be the element in M that 
represents the connection/disconnection between the node-pairs Wx and Wy. The SimRank score associating with 
Wx and Wy is: 

S{Wx) 

= S{Vxl,Vx2) 

, |0(f^l)| |0(t>^2)| 

— E E S{O.MO,{Vx2)) 



\o{vx,)\\o{vx2)\ ^ jr[ 

dout 



n n 



i=l J=l 



E S{wy) X ex,v, 



S{Wy) 

= S{Uyl,Uy2) 

= |/Ki)l|/K.)l g g sn^M^i.M) 

din \ ^ ^/ \ 

= — ^ 2^ S{Wx) X ex,y. 

i=i j=i 

In Equation (8), the ^^"=1 (Y^=i ^x2.j) indicates the number of in-neighbors of v^i {Vx2), and S{wy) is the 
SimRank similarity between Vxi's potential neighbor and Vx2'& potential neighbor. Whether a S{wy) score will be 
counted in the S{wx) or not depends on the value of ex,y S {0, 1} in the graph matrix M of G^. 

Now we show the corresponding calculation of SimRank score in cipher field. The challenge is that, both the 
network G and the corresponding node-pair graph G^ are built in ciphers. We then need to encrypt the SimRank 
similarity score using the same public key pk, since the FHE homomorphic properties hold only when the score is 
encrypted in the same field as the link information. 
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Specifically, note that the FHE method takes only binary integer, i.e., {1/0}, as inputs, while the similarity 
score S{wx) is a real number e [0, 1]. To allow the encryption of S{wx) using FHE, we encode the real number 
S{wx) G [0, 1] (as well as din and dout) into binary representation, i.e., a bit string, and encrypt the bit string into 
cipher string. The arithmetic operations on the real numbers then change to the binary arithmetic operations on 
the (bit/cipher) string. Let x , 4^, and Yl denote the multiplication, division and sum binary arithmetic operations, 
respectively. We discuss the implementations of the binary arithmetic operations in Section IV-A, and convert 
Equations (8) and (9) into the following formulas in ciphertext field. 



Q{wx) = EnCpk{S{wx)) 

= A X ^ {Q{wy) X Trx,y), 



where 



Q{wy) = EnCpk{S{wy)) 

= □ >< ^ (OK) X nx,y), 



(10) 



(11) 



According to Equations (7), (10) and (11), Bob is able to perform the whole process of SimRank similarity 
calculation in ciphers without knowing the explicit link information of Alice's data. PP-SimRank, therefore, achieves 
a secure similarity measure on a joint networks consisting of different parties' data. 

IV. Implementation Issues 

A. Arithmetic operations in the ciphertext field 

As mentioned in Section 111-D, we implement the binary arithmetic operations to perform the addition, multi- 
plication and division on the cipher strings representing Q{w). First, we implement two virtual circuit functions, a 
full-adder function and a full-subtractor function, that can simulate the functionality of a real full-adder and full- 
subtractor circuit using the fully-homomorphic properties of FHE, respectively. By combining a sequence of the 
full-adder (or full-subtractor) function, we carry out a binary addition + (or subtraction — ) function that can perform 
the addition (or subtraction) of two cipher strings. After that, the multiplication function x can be developed by 
calculating the partial products of the input cipher strings, shifting each of the product to the left, and adding them 
together using the addition function +. Similarity, division function consists of subtraction function — and the 
shifting mechanism. We explain the details in the following. 
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Fig. 6. Logical circuits of (a) a full-adder and (b) a full-subtractor. 



Binary addition +. As shown in Figure (6)(a), in electronics, a real full-adder is composed by a number of 
logical operations. The real full-adder adds the operand bits o and 6 with the carried-in bit carryin received from 
the previous full-adder, and output the answer bit sum and carry-out bit carryout as the result. The operations of 
a real full-adder can be expressed in terms of XOR and AND as follows: 

sum = a ® b (B carryin, 

(12) 

carryout = {a Ah) ® {carryin A {a®b)). 

As these operations are in bit level the same as the inputs of FHE, we then can convert the real full-adder into 
ciphertext field using the fuUy-homomorphic properties. By replacing the XOR- and AND- operations with addition 
and multiplication, respectively. Equation (12) is transformed into the following formulas: 

Csum ~ Ca ~\~ Cl) -\- C^arryini /i o\ 

where Cgumy Ca, Cb and Ccarry denote the corresponding ciphers of sum, a, b and carry, respectively. Based on 
Equation (13), we then develop a function, called virtual full-adder function which executes the functionality of the 
real full-adder circuit in ciphertext field. After that, let each similarity score S{w) in PP-SimRank use I bits for its 
binary representation. By connecting I virtual full-adder function together, i.e., each function takes the output cipher 
Ccarry out prcvious full-addcr function as its input carry-in cipher Ccarryin^ we can carry out a binary addition 
function + to add cipher strings of Q{w) in ciphertext field. 
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Binary multiplication x. We further utilize the binary addition function + to implement the binary multipUcation 
function x. Specifically, similar to what we usually do in the computation of multiplying two decimal numbers, 
the binary multiplication function x can be carried out by computing a sequence of partial products, shifting the 
resulting partial products to the left, and finally applying the binary addition fuction + to add the products together. 

Similarly, we begin from the real fuU-subtractor circuit and then explain the implementations of binary subtraction 
— and binary devision operations in ciphertext field. 

Binary subtraction — . The logical operations of a real fuU-subtractor shown in Figure 6(b) can be expressed as 
follows: 



diff = a®b® borrowin , 



borrowout = { a Ah) V {borroWin A (a © 6) ) 

= ((1 © a) A 6) V {borroWin A (1 © (a © 6))), 

where the NOT-operation on a bit a is achieved by computing the value of (1 © a). Let 

tl = (1 © a) A b, 

t2 = borroWin A (1 © (a © 6). 



(14) 



(15) 



Based on the truth tables in Figure 5, borrowout can also be expressed in terms of XOR- and AND- logics. 

barrawout = {tl © t2) © (tl A t2). (16) 



Accordingly, a virtual full-subtractor in ciphertext field is derived in Equation (17) (from Equations (14), (15) and 
(16)) again by substituting the XOR- and AND- logics with addition and multiphcation, respectively. 

^diff ~ -j- -j- Cltorrowini (17) 
^borroWout = (Ctl + Ct2) + (Ctl X Ct2), 

where 



Ctl = (Cl + Ca) X Cb, 

Ctl = {Cborrowin X (ci + (Co + Cb)). 
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TABLE I 

The execution time of FHE encryption, FHE re-encryption and Ciphertable look-up w.r.t. various security levels. 





FHE 


FHE 


CipherTable 


Security Level 


Encryption 


Re-Encryption 


Look-up 


(dimension) 


(sec) 


(sec) 


(sec) 


512 


0.217 


6.3 


0.020 


2048 


2.084 


34.1 


0.254 


8192 


20.300 


185 


2.397 



By regarding a virtual full- subtracter as a basic unit, we can thus implement a binary subtracter — in a similar way 
of building a binary addition +, where each function takes the output cipher Cborrowa^t of previous fuU-subtractor 
function as its input borrow-in cipher Cborrowin ■ 

Binary division 4-. The binary division function also can be carried out by utilizing a sequence of binary 
subtraction function — . Similar to the decimal division, the binary division function computes the quotient cipher 
string by iteratively subtracting divisor from the dividend using binary subtraction function — and shifting the 
divisor to the right after each binary subtraction — . Each ciphertext in the quotient cipher string is determined by 
adding the output borrow cipher Cborrowo^t of the binary subtraction — with a cipher ci = -Bncpk(l), which is 
equivalent to the NOT-operation (1 ® borrow out) in the plaintext field. 

B. Efficiency improvement 

FHE encrypts bits into ciphertexts (with random noises) in the form of very big integers to achieve semantic 
security. A side effect, however, is that operations on big integers usually cost a lot of time. In this subsection, we 
propose to reduce the execution time of PP-SimRank in three ways: reducing the number of binary multiphcations 
in the score calculations, replacing the FHE encryption with look-up, and carrying out the FHE re-encryption 
mechanism by Re-encryption protocol instead. 

Reducing the number of binary multiplications. Equations (10) and (11) give the detailed calculations of 
SimRank similarity in ciphertext field. We note that the execution efficiency can be improved by replacing the 
binary multipUcation x of {@{wx) x tt) with simple multiphcations x. Specifically, recall that as mentioned in 
Section III-C, a ciphertext tt represents a bit e in the plaintext, and each SimRank score @{wx) in the ciphertext 
field is represented in the form of a cipher string corresponding to the binary representation (a bit string) of S{wx) 
in plaintexts. Because e is either or 1, multiplying e with the bit sting of S{wx) can be carried out by AND- 
operations between them. Correspondingly, in the ciphertext field, we can achieve {Q{wx) x tt) in Equations (10) 
and (11) by multiplying (corresponding to AND-operation in plaintext field) tt with every ciphertext in the cipher 
string of Q{'Wx), instead of performing the time consuming binary multiplication. 

CipherTable look-up. Table I shows the FHE encryption time of one bit with respect to different security levels'. 

Un [3], they use lattices of 512, 2048 and 8192 dimensions to provide seciaity levels called as toy, small and medium. 
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Accordingly, it would be infeasible if Alice and Bob encrypt their data by calling the FHE encryption function. 

To solve this problem, we use CipherTable look-up instead which is at least 8 times faster than FHE encryption 
according to the results in Table I. That is, since there are only two possible inputs 1 and of FHE, Alice and 
Bob can share the pubhc key pk in advance and respectively prepare a CipherTable containing a set of ciphertexts 
w.r.t. the plaintexts {0, 1} before they really start computing the similarity measures on their joint data. Therefore, 
when PP-SimRank is performed, Alice and Bob can encrypt their data in step 2 of Figure 3 by simply choosing a 
corresponding ciphertext from their own CipherTable, instead of calUng FHE encryption function. As FHE outputs 
different ciphertexts (by introducing a random noise in the encryption) and the CipherTable is built by each party 
itself, this replacement is satisfactory. 

Re-encryption protocol. The PP-SimRank protocol protects the Unk information using FHE. For a given bit 
representing direct connection or disconnection, FHE will introduce a random noise into the encryption to output 
different ciphertexts. Due to the random noise, the inverse mapping from ciphers to the bit becomes challenging. 
However, the noise in a ciphertext will propagate and grow with the homomorphic operations operated on it. When 
the magnitude of noise reaches a certain limitation^, the ciphertext will not be able to be correctly decrypted. A 
re-encryption mechanism that can refresh the ciphertext by reducing the associated noise is thus required. 

The work in [3] has suggested a mechanism, called bootstrapping, for re-encryption. The idea of bootstrapping 
is to refresh a ciphertext by decrypting the ciphertext homomorphically with the encrypted secret key. As shown in 
Table I, however, the re-encryptions of bootstrapping under various security level are very time-consuming. Instead, 
we propose an more efficient re-encryption protocol for our problem scenario. The results in Table I shows that 
our re-encryption protocol runs at least 77 times faster than FHE re-encryption method. 

Specifically, when it is detected^, by Bob, that the magnitude of noise associating with a ciphertext reaches a 
hmitation during the computation. Bob asks Ahce's help for re-encryption. The noisy ciphertext is sent to Alice. 
Ahce then decrypts the noisy ciphertext using the secret key sk (which is kept strictly to Alice), and encrypts the 
resulting bit again using the same public key pk. Here, as mentioned previously, Alice can perform the encryption 
by looking up her CipherTable, rather than call the FHE encryption function. The re-encryption efficiency is thus 
improved significantly. After that, a corresponding new ciphertext is sent back to Bob for subsequent computation. 

This re-encryption protocol will not damage the private Unk protection. For Bob, what he sends out and receives 
are both ciphertexts. Therefore, he cannot get additional information about the links in Ahce' data. On the other 
hand, although Ahce knows the true value of the received bit by decryption, she does not know the semantic 
meaning of the bit since a bit can be any link information in Bob's data, any bit in the binary representation of a 
similarity score between any two vertices, or even a temporary bit appearing only during the computation process. 
Consequently, the re-encryption protocol is efficient and satisfactory for our problem scenario. 

^Given the public key, FHE [3] allows one to test the magnitude of noise in a ciphertext. If the noise grows beyond a certain ratio of the 
decryption radius, there is a demand of re-encryption. 
'Refer to Foomote 2. 
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V. Performance analysis 

A. Security of PP-SimRank 

The purpose of PP-SimRank is to achieve SimRank similarity score co-computation based on a joint virtual 
network G — (V^ U V^,U, U E^) while protecting every Unk information in both E^ and E^ from being 
stolen. In the analysis, we ask two questions: (1) Can Bob who has the whole network (in virtual) derive the links 
in Ahce's data by generating many plaintext-ciphertext pairs with the given public key pk? (2) Can AUce or Bob 
derive the links in the other's data after the similarities between nodes in J7 is revealed? 

The first question is also known as a birthday attack [8]. Since, in PP-SimRank, the public key pk generated 
by Alice is shared with Bob, Bob is able to generate cipher-plaintext pairs by himself. The cipher-plaintext pairs 
he generated may contain a cipher that collides with an encrypted data received from Alice. The Unk information 
in E-^, therefore, has a chance to be stolen by Bob. However, according to the study [3], the ciphertext space of 
FHE is enormous large, i.e., ciphertext c G = {1, 2, . . . , d} where logj d G {195764, 785006, 3148249} w.r.t. 
the security level toy, small and median. Under the analysis of the birthday problem [8], the expected number of 
ciphertexts Bob has to generate before he finds the first collision is about y ^^^''^ significantly large. 

Hence, the probabihty that Bob finds a collision and steals the link information from Alice is neghgibly small. 

For the second question, the answer is "No". The reason is that, in addition to having the pairwise similarity 
scores S{wy) between nodes in [/, a party also needs the pairwise similarity scores S{wx) between the nodes 
in {V^ U V^) so that the link information of the other party can be derived by solving linear algebra problems. 
However, at the end of PP-SimRank protocol, the similarity score S{wx) between the nodes in {V^ U V^) are 
not available for both Alice and Bob since Alice does not have the data and Bob does not have the secret key for 
decryption. Consequently, in a semi-honest security model, PP-SimRank does not give a chance to a party to steal 
the link information of the others. 

B. Communication and computation complexity 

We illustrate the communication and computational cost of PP-SimRank on a joint network G = {V^ U 
V^, U, E^ U E^) in this section. Assume there are m nodes in {V^ U V^), n nodes in U, and the dimension 
of the node-pair graph matrix M is C™ x C2 , i.e., 0{m'^) x O(n^). Let I be the number of bits used for binary 
representation of a SimRank score in plaintext, and k be the number of bits for a ciphertext transmission. First, we 
analyze the computational cost on the calculator. Bob. To construct the virtual node-pair graph in step 3 of PP- 
SimRank, Bob computes each element tt^.j/ in M based on Equations (7), consisting of two additions -|- and three 
multiplications x. Therefore, Bob performs totally 0{w?ii?) additions and multiplications to fill the whole graph 
matrix. Next, in step 4 of PP-SimRank, Bob performs the SimRank score computation in ciphertext field according 
to Equations (10) and (11). For each score 'd{'Wx) (or Q{wy)), the required arithmetic operations on the cipher 
string include 0(1) binary division O(n^) (or 0{m?)) binary additions + and 0(n^ + 1) (or + 1)) binary 

multiplications x, because the number of nodes in {y^VJV^Y 0{rn'^) and O(ri^), respectively. We can 
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TABLE II 

The execution time (sec) of (binary) arithmetic operations in ciphertext field. 



Arithmetic Operation 


Security Level (dimension) 


512 


2048 


8192 


+ 


1.0 X 10-5 


3.0 X 10-5 


1.4 X 10-* 


X 


4.67 X IQ-^ 


0.03 


0.135 


+ 


0.81 


5.58 


54.92 




1.11 


8.61 


92.7 


X 


5.30 


47.06 


584.00 




22.39 


191.94 


1961.00 



TABLE m 

The numbers of re-encryptions involved during various operations. 



Arithmetic Operation 


Num. of Re-encryptions 


max 


Avg. 


min 


+,x,4-,- 











X 


40 


36.04 


33 




280 


224.31 


191 



further improve the performance by replacing the binary multipUcation of (6(w) x tt) in Equations (10) and (11) 
with I simple multiphcations x as explained in Section IV-B. The complexity for computing a Qiw^) (or 6(wy)) 
therefore becomes 0(1) binary division O(n^) (or 0(m^)) binary additions +, 0{n'^l) (or 0{m?l)) simple 
multiplications and 0(1) binary multiphcation. The total binary arithmetic operations required in one iteration to 
compute all the SimRank scores are thus 0(m^+n^) binary divisions, 0{m?+'n?) binary multiphcations, 0{m?'n?) 
binary additions and 0{'m?'n?l) simple multiphcations. Table 11 shows the execution efficiency of each (binary) 
arithmetic operation in ciphertext field. 

In PP-SimRank, the communication between Alice and Bob occurs in three steps, including encrypted link data 
transmission after step 2, resulting cipher SimRank score transmission after step 5 and re-encrypt protocol required 
during computation. First, assuming that Alice has \V^\ nodes that exclusively owned by herself. She thus needs 
totally (niy^^l) bits to encode all the link information between and U and transmits totally 0(A;n|y^|) bits data 
to Bob for subsequent calculations. Second, after completing the SimRank score computation. Bob will send O(n^) 
noisy cipher SimRank scores Q'{wy) = Q{wy)+ry to Ahce after step 5. Since each cipher score is represented by 
I ciphers and each cipher needs k bits for transmission, the communication cost is 0{'n?lk) bits. Finally, as shown 
in Table III, the times of re-encryption required vary with binary arithmetic operations on different ciphertexts. 
Here we evaluate the communication cost in one re-encryption protocol: Alice and Bob send one cipher to each 
other. Therefore, the cost is 0{k) bits. 
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VI. Extension to multi-party scenario 

In this section, we further extend the PP-SimRank protocol to a more general case, i.e., multi-party scenario. 
We assume there are totally s parties and the i-th party holds an information network G* = {V, U, E'^) where the 
vertex sets V^, V"^, . . . , and are mutually disjoint. The purpose is to let all parties learn the similarity between 
nodes in U based on their joint network G = (uy , U, U£") while the Unks hold by each party are strictly known 
to itself. 

Similar to the two-party scenario, here PP-SimRank also asks two of the s parties to act as a cryptographer 
(Ahce) and a calculator (Bob) and to perform their duty respectively. For the rest parties, they encrypt their data 
with the public key pk received from the cryptographer, and then transmit the encrypted data to the calculator 
after step 2 of PP-SimRank protocol (shown in Figure (3)). In addition, each of the rest parties also determines a 
random number list -R' containing elements, encrypts each element r^, y = {1, . . . , C2}, with pk, and sends 
the encrypted random numbers to the calculator. After the calculation of SimRank sinularity is completed in step 
4, the calculator adds each of the resulting SimRank scores Q{wy) with a set of random numbers r^, r^, . . . , r^~^ 
collected from all parties except the cryptographer. Therefore, at the end, the cryptographer will hold the noisy 
scores S'{'Wy) after decryption, and each of the other parties will have a Ust of random numbers that have been 
added to the noisy scores S'{'Wy). The correct SimRank scores then can be derived after all parties exchange these 
information with each other. 

Now consider the corresponding complexities. Assume there are totally m nodes in {V^ U U • • • U V^), i.e., 
m = \V^\ + \V^\ + ■ ■ ■ + \V^\) and n nodes in U, i.e., n = \U\. The dimension of the node-pair graph matrix 
M is therefore 0{rn?) x O(n^), which is the same with the dimension in the two-party scenario. As the result, 
the computational complexity of multi-party PP-SimRank is identical to that we have analyzed in the two-party 
scenario. On the other hand, the communication cost slightly increases. Assume that each score uses I bits for its 
binary representation and each ciphertext needs k bits for its transmission. There is a new cost 0{{s — 2)nHk) for 
transmitting (s — 2) encrypted Usts of random numbers from all parties, except the cryptographer and the calculator, 
to the calculator, while the cost for encrypted link data transnoission becomes 0((m — |F^|)nfc) bits. The cost for 
transnutting the resulting cipher SimRank scores remains the same. 

VII. Conclusion 

In this paper, we have addressed the problem of privacy-preserving co-computation of Unk-based similarity 
measure in a distributed bipartite network, and proposed a new privacy-preserving protocol, PP-SimRank, as a 
solution. PP-SimRank strictly protects each party's hnk information and reveals nothing but the desired similarity 
measures on the (virtual) joint network. We also implemented the basic binary arithmetic operations in ciphertext 
field to securely compute SimRank scores under fuUy-homomorphic encryption (FHE) and demonstrated the 
potential of FHE in privacy-preserving data mining. 
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